Storage Cluster Maintenance

Background

As of RHQ 4.12 there are several maintenance tasks or jobs that are performed on the storage cluster or a subset of it. Examples of these maintenance jobs include,

Adding a storage node to the cluster (via the deploy process)
Removing a storage node from the cluster (via the undeploy process)
Running anti-entropy repair
Changing heap settings
Changing storage cluster credentials (note that only the password can be changed)

The maintenance jobs can be complex, multi-step workflows that involve executing different resource operations across storage nodes as well as perform actions on the server like updating the Cassandra schema, modifying StorageNode entities, or changing the shared, storage cluster settings. There are other maintenance jobs for which we need to add support including,

Changing a storage node's address (BZ 1103840)
Changing the location of the data directory (BZ 1074637)

Current Design and Implementation

We need to briefly discuss the current design and implementation in order to understand the problems they introduce and the changes that will remedy them. The discussion will primarily focus on the details for adding nodes, removing nodes, and running anti-entropy repair as those maintenance jobs all share the same design and implementation. The server side functionality is primarily implemented in the StorageNodeManagerBean and StorageNodeOperationsHandlerBean EJBs, and the agent side functionality is primarily found in StorageNodeComponent in the rhq-storage plugin.

Take a look at this diagram to see the workflow involved with going from a one node to a two node cluster. The flow of execution starts in the upper left moving to the right. When the end of the row is reached, execution wraps around to the next row starting on the far left again. In the first row when the server thread T1 schedules the announce operation, it become to execute other tasks. It does not block waiting for a response from the agent. Nor does it block waiting for a state change in the database. The server resumes execution when notified by the agent. And as the diagram shows, it can be a different server thread that resumes execution. In fact, the agent could send the operation result to a different server, and that server would continue executing the workflow steps.

Lastly, note that after the agent finishes a resource operation the server updates the StorageNode.operationMode property which is used for tracking the state and progress of the workflow. This is discussed in more detail in the following sections.

Problems with Current Design and Implementation

There are a number of issues that make managing the storage cluster more difficult than it ought to be. Several, related bugs have been grouped under an umbrella bug, BZ 1120418. The problems highlighted and discussed here cover these bugs.

Fault Tolerance

The current implementation lacks fault tolerance. Any failure, even what would constitute a partial one, will be treated as a total failure. Suppose we have a three node cluster with node0, node1, and node2, and we want to deploy a fourth node, node3. And let's say that node2 and its agent are down. The deployment will fail because we cannot execute the announce operation on node2. Cassandra is perfectly capable of dealing with down nodes. A new node can still bootstrap into the cluster. When the down node comes back up, it will learn that the cluster topology has changed and make the necessary adjustments. Knowing this, we should be able to work around the fact that node2 is down and proceed with the deployment of node3.

Recovery

We currently lack any sort of auto-recovery mechanism. In the cases of the deploy or undeploy processes, it is entirely up to the user to retry when there is a failure. This can be counter-intuitive for users. If we are unable to perform some maintenance on a node simply because it is down, then we should be capable of doing the maintenance when it comes back up without the need for user intervention.

Let's revisit the example of deploying that fourth node. If we are fault tolerant, then deploying node3 should succeed. There is still the issue of node2. When it comes back up we need to run the announce operation and repair operation on it. There is no reason that the user should have to do this manually. Now let's say that instead of node2 going down, node3 is the one that goes down. It crashes during bootstrap. In this case we obviously cannot continue with the deployment, but we can attempt to recover without user intervention. We can attempt to bootstrap the node again, and only after multiple failed attempts do we turn to user intervention.

Work Flow State

State is maintained in a couple places. The operationMode and maintenancePending properties of the StorageNode class are used for tracking state. operationMode is also used for the Cluster Status column in the admin UI. This leads to confusion for users because when a cluster management task fails, the Cluster Status column for the node shows a value of DOWN. The user in turn thinks that the node is down when often times that is not the case.

The second place state is maintained is in StorageNodeOperationsHandlerBean. It is essentially hard coded into its methods. There are methods that correspond to each of the work flow states and transitions. This is a bad separation of concerns and has resulted in what is really a lot of duplicate, brittle code.

Adding New Work Flows

This kind of ties in with the work flow state. As things currently exist, adding a new work flow will require adding a number of methods in StorageNodeOperationsHandlerBean in addition to any changes in the rhq-storage plugin. This code can and should be generalized so that adding or changing a work flow does not require big, invasive changes to our SLSB code.

Scheduling

We only want to allow a single work flow to execute at a time. Deploying multiple node simultaneously for example, can lead to problems like schema disagreement. There is nothing in the current implementation to prevent simultaneous deployments. Nor is there any to prevent a user from adding/removing a node while the weekly, scheduled repair is running.